Skip to content

Add bulk apis for pipeline status#25731

Merged
ulixius9 merged 18 commits intomainfrom
batch_pipeline_status
Feb 10, 2026
Merged

Add bulk apis for pipeline status#25731
ulixius9 merged 18 commits intomainfrom
batch_pipeline_status

Conversation

@harshach
Copy link
Collaborator

@harshach harshach commented Feb 6, 2026

Add bulk apis for pipeline status

Describe your changes:

Fixes

I worked on ... because ...


Summary by Gitar

  • New bulk API endpoint:
    • PUT /{fqn}/status/bulk accepts list of PipelineStatus objects, reducing API calls from N to 1 per pipeline
  • Backend optimization:
    • JDBI batch operations with MySQL/PostgreSQL upsert support in PipelineRepository.bulkUpsertPipelineStatuses
    • Bulk Elasticsearch indexing via SearchRepository.bulkIndexPipelineExecutions
  • Databricks connector enhancement:
    • yield_pipeline_status now aggregates runs within configurable numberOfStatus window (default: 1 day)
  • Python SDK additions:
    • OMetaBulkPipelineStatus model and add_bulk_pipeline_status method in OMetaPipelineMixin
  • Integration tests:
    • put_bulkPipelineStatus_200_OK validates 5 status submissions with overlap handling
    • put_bulkPipelineStatus_invalidTask_4xx verifies task validation

This will update automatically on new commits.


Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

TypeScript types have been updated based on the JSON schema changes in the PR

@github-actions github-actions bot requested a review from a team as a code owner February 6, 2026 07:54
Comment on lines +268 to +269
if run.start_time and run.start_time < cutoff_ts:
break
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Edge Case: Databricks connector break assumes descending run order

The yield_pipeline_status method uses break when it encounters a run with start_time < cutoff_ts. This assumes that self.client.get_job_runs() returns runs in strictly descending chronological order (newest first).

If the Databricks API returns runs in a different order (e.g., ascending, or unordered), this break will cause the method to skip recent runs that appear after an older one in the list. This would result in incomplete pipeline status data being ingested.

Suggestion: Either:

  1. Verify and document that get_job_runs() guarantees descending order (Databricks Jobs API does return runs in descending order by default when using list_runs)
  2. Use continue instead of break to safely handle all runs regardless of ordering, though this loses the early-termination optimization
  3. Add a sort before iterating: sorted(runs, key=lambda r: r.get('start_time', 0), reverse=True)
if run.start_time and run.start_time < cutoff_ts:
    continue  # safer: skip old runs without assuming order

Was this helpful? React with 👍 / 👎

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion:trivy (debian 12.12)

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
libpam-modules CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-modules-bin CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-runtime CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam0g CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (33)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (18)

Package Vulnerability ID Severity Installed Version Fixed Version
Werkzeug CVE-2024-34069 🚨 HIGH 2.2.3 3.0.3
aiohttp CVE-2025-69223 🚨 HIGH 3.12.12 3.13.3
aiohttp CVE-2025-69223 🚨 HIGH 3.13.2 3.13.3
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6
azure-core CVE-2026-21226 🚨 HIGH 1.37.0 1.38.0
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
protobuf CVE-2026-0994 🚨 HIGH 4.25.8 6.33.5, 5.29.6
pyasn1 CVE-2026-23490 🚨 HIGH 0.6.1 0.6.2
python-multipart CVE-2026-24486 🚨 HIGH 0.0.20 0.0.22
ray CVE-2025-62593 🔥 CRITICAL 2.47.1 2.52.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO

No Vulnerabilities Found

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion-base-slim:trivy (debian 12.13)

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (33)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (8)

Package Vulnerability ID Severity Installed Version Fixed Version
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/extended_sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/lineage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data_aut.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage_aut.yaml

No Vulnerabilities Found

ulixius9
ulixius9 previously approved these changes Feb 6, 2026
@gitar-bot
Copy link

gitar-bot bot commented Feb 9, 2026

🔍 CI failure analysis for 4a073fa: Maven SonarCloud CI job failed with 5 Data Product test failures unrelated to this PR's pipeline status API changes.

Issue

Maven SonarCloud CI: 5 Data Product test failures + 1 test error

Root Cause

Unrelated to PR changes - Data Product resource tests are failing with assertion errors. This PR modifies pipeline status APIs, not Data Product functionality.

Details

Failed tests: All in DataProductResourceTest

  • testDataProductBulkOutputPorts - expected: success but was: failure
  • testGetPortsByNameEndpoints - expected: 1 but was: 0
  • testGetPortsViewEndpoint - expected: 1 but was: 0
  • testGetOutputPortsReturnsFullEntities - expected: 1 but was: 0
  • testDataProductDomainMigrationWithInputOutputPorts - Output port should be in target domain after migration (domain UUID mismatch)
  • testDataProductBulkPortsViaApi - test error

Test results:

  • Job 62999935598 (maven-sonarcloud-ci)
  • 7919 tests run, 5 failures, 1 error, 701 skipped
  • Duration: ~3.5 hours

Why unrelated to PR:

  1. This PR changes pipeline status ingestion APIs (Java backend + Python connectors)
  2. Modified files: PipelineRepository.java, PipelineResource.java, databrickspipeline/metadata.py, pipeline_status.py, SQL migrations
  3. No changes to Data Product functionality - These are separate domains
  4. Test failures are in Data Product bulk operations, port endpoints, and domain migration
  5. The PR doesn't touch Data Product repository, resource classes, or domain migration logic
  6. All failures show business logic assertion mismatches (status codes, counts, domain IDs), not compilation or integration issues
Code Review ⚠️ Changes requested 7 resolved / 13 findings

Bulk pipeline status API is well-structured with good test coverage. All 6 previous findings remain unresolved — most notably the missing reindexAcrossIndices call in the bulk path and the MySQL migration data loss risk for rows with missing $.timestamp.

⚠️ Edge Case: Databricks connector break assumes descending run order

📄 ingestion/src/metadata/ingestion/source/pipeline/databrickspipeline/metadata.py:268-269

The yield_pipeline_status method uses break when it encounters a run with start_time < cutoff_ts. This assumes that self.client.get_job_runs() returns runs in strictly descending chronological order (newest first).

If the Databricks API returns runs in a different order (e.g., ascending, or unordered), this break will cause the method to skip recent runs that appear after an older one in the list. This would result in incomplete pipeline status data being ingested.

Suggestion: Either:

  1. Verify and document that get_job_runs() guarantees descending order (Databricks Jobs API does return runs in descending order by default when using list_runs)
  2. Use continue instead of break to safely handle all runs regardless of ordering, though this loses the early-termination optimization
  3. Add a sort before iterating: sorted(runs, key=lambda r: r.get('start_time', 0), reverse=True)
if run.start_time and run.start_time < cutoff_ts:
    continue  # safer: skip old runs without assuming order
⚠️ Bug: MySQL migration: data loss risk during column drop/re-add

📄 bootstrap/sql/migrations/native/1.11.8/mysql/schemaChanges.sql:32-36

The MySQL migration drops the timestamp column and re-adds it in a single ALTER TABLE statement. While MySQL can handle this atomically within one DDL statement, dropping a VIRTUAL generated column and re-adding it as STORED forces a full table rebuild on entity_extension_time_series. On large deployments, this table can contain millions of rows (pipeline statuses, test results, profiler data, etc.), which means:

  1. Table lock duration: The ALTER TABLE will hold a metadata lock for the entire rebuild, blocking all reads/writes to this table.
  2. Downtime risk: Any concurrent pipeline status writes or reads during migration will be blocked.

Consider using pt-online-schema-change or gh-ost for large deployments, or at minimum, document the expected downtime/lock duration in the migration comments so operators can plan accordingly. Also consider adding a note about the migration's impact in the PR description since this is a high-traffic table.

⚠️ Edge Case: Bulk DB insert not atomic with validation/ES indexing

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java:419

In addBulkPipelineStatus, the flow is:

  1. Validate tasks (lines 389-396)
  2. Bulk upsert to DB (line 405)
  3. Bulk index to ES (line 407)
  4. Store RDF (line 409)
  5. Update pipeline entity index (line 419)

If step 3, 4, or 5 throws an exception, the DB records from step 2 have already been committed (JDBI useHandle auto-commits). Unlike the single-status path which wraps ES indexing in a try-catch (lines 360-365), the bulk path's bulkIndexPipelineExecutions already has its own try-catch internally, which is good. However, the searchRepository.updateEntityIndex(pipeline) call at line 419 is NOT wrapped in a try-catch. If it throws, the response will be an error even though the DB upsert succeeded, leaving the client uncertain about the operation's status.

Suggested fix: Wrap the entity index update in a try-catch to match the pattern used for ES execution indexing:

try {
    searchRepository.updateEntityIndex(pipeline);
} catch (Exception e) {
    LOG.error("Failed to update pipeline entity index in Elasticsearch", e);
}
⚠️ Bug: Bulk status path missing reindexAcrossIndices call

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java:419-423 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java:419

The single-status addPipelineStatus method (line 371) calls reindexAcrossIndices("upstreamLineage.pipeline.fullyQualifiedName", ...) to update downstream lineage entities in the search index when a pipeline's status changes. The bulk path addBulkPipelineStatus (lines 419-423) only calls searchRepository.updateEntityIndex(pipeline) but omits the reindexAcrossIndices call entirely.

This means entities downstream of the pipeline (e.g., tables with lineage from this pipeline) won't have their search index updated when bulk statuses are ingested, causing stale upstreamLineage data in search results.

Suggested fix: Add the same reindexAcrossIndices call after searchRepository.updateEntityIndex(pipeline):

try {
    searchRepository.updateEntityIndex(pipeline);
    searchRepository.getSearchClient()
        .reindexAcrossIndices(
            "upstreamLineage.pipeline.fullyQualifiedName", pipeline.getEntityReference());
} catch (Exception e) {
    LOG.error("Failed to update pipeline entity index in Elasticsearch", e);
}
💡 Quality: Redundant pipeline lookup when statuses list is empty

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java:377-384

In addBulkPipelineStatus, when the list is null/empty, the method fetches the pipeline and returns it (lines 377-380). Then if the list is non-empty, it fetches the pipeline again (lines 383-384). This means in the normal (non-empty) path, the pipeline is fetched once, but the empty-list guard creates an unnecessary code path that returns ENTITY_UPDATED even though nothing was updated.

Consider either:

  1. Returning Response.Status.NO_CONTENT for empty lists, or
  2. Moving the pipeline fetch before the check and reusing it:
Pipeline pipeline = daoCollection.pipelineDAO().findEntityByName(fqn);
pipeline.setService(getContainer(pipeline.getId()));
if (pipelineStatuses == null || pipelineStatuses.isEmpty()) {
    return new RestUtil.PutResponse<>(Response.Status.OK, pipeline, ENTITY_NO_CHANGE);
}
💡 Edge Case: Bulk status doesn't check if latest exceeds existing status

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java:395-400

The addBulkPipelineStatus method finds the latest status among the submitted batch and sets it as the pipeline's pipelineStatus. However, it doesn't check whether the pipeline already has a more recent status from a previous call. If someone submits a batch of older statuses after a newer one was already set, the pipeline's pipelineStatus would regress to an older timestamp.

For comparison, the existing single-status addPipelineStatus also doesn't guard against this, so this may be by-design. But with bulk operations, the risk is higher since a batch of historical data is a common use case.

Suggestion: Consider comparing the batch's latest timestamp against the pipeline's current pipelineStatus timestamp before overwriting.

✅ 7 resolved
Edge Case: Bulk task validation NPE when taskStatus is null

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java:390-395
In addBulkPipelineStatus, the inner loop iterates over pipelineStatus.getTaskStatus() without null checking. If a PipelineStatus is submitted with no taskStatus field (which is valid per the schema — the field is optional), getTaskStatus() may return null, causing a NullPointerException.

The single-status path at line 326 has the same pattern, so this is a pre-existing pattern, but introducing it in a bulk API amplifies the impact — a single null taskStatus in a batch of 1000 would fail the entire batch after potentially partial DB writes.

Suggested fix:

for (PipelineStatus pipelineStatus : pipelineStatuses) {
    for (Status taskStatus : listOrEmpty(pipelineStatus.getTaskStatus())) {
        if (validatedTasks.add(taskStatus.getName())) {
            validateTask(pipeline, taskStatus.getName());
        }
    }
}

The listOrEmpty utility is already imported in this file.

Bug: Generated TS uses numberOfStatus but schema defines statusLookbackDays

📄 openmetadata-spec/src/main/resources/json/schema/metadataIngestion/pipelineServiceMetadataPipeline.json:85-90 📄 openmetadata-ui/src/main/resources/ui/src/generated/metadataIngestion/pipelineServiceMetadataPipeline.ts:47-51
The JSON schema at pipelineServiceMetadataPipeline.json defines the field as statusLookbackDays, and the Python Databricks connector correctly references self.source_config.statusLookbackDays. However, all four generated TypeScript files define the property as numberOfStatus instead:

  • createIngestionPipeline.ts line 634
  • ingestionPipeline.ts line 1263
  • pipelineServiceMetadataPipeline.ts line 51
  • workflow.ts line 5115

This means the frontend TypeScript code would use numberOfStatus while the backend Python code uses statusLookbackDays, creating a naming mismatch. The TS codegen appears to have been run against a different version of the schema (possibly before the rename from numberOfStatus to statusLookbackDays).

Fix: Regenerate the TypeScript types from the current JSON schema so the field is named statusLookbackDays consistently across all layers.

Quality: Dead code: EntityTimeSeriesDAO upsert/upsertBatch added but unused

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityTimeSeriesDAO.java:92-97
The EntityTimeSeriesDAO interface gains three new methods: upsert(table, ...), upsert(entityFQN, ...), and upsertBatch(...). However, none of these are used anywhere in the codebase. The PipelineRepository.bulkUpsertPipelineStatuses method uses raw JDBI batch operations with Entity.getJdbi().useHandle(...) instead of going through the DAO.

This creates two issues:

  1. Dead code that adds maintenance burden
  2. Two competing approaches to the same operation (DAO method vs. raw JDBI)

Suggestion: Either remove the unused DAO methods, or refactor PipelineRepository to use them. Note that upsertBatch currently iterates one-by-one (defeating the batch purpose), so if you keep it, consider making it truly batch.

Edge Case: Bulk API endpoint has no payload size limit

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/pipelines/PipelineResource.java:523
The addBulkPipelineStatus REST endpoint at /{fqn}/status/bulk accepts a List<PipelineStatus> with no upper bound on the number of items. A caller could submit thousands or millions of status records in a single request, which would:

  1. Cause an oversized JDBI batch operation that may exhaust memory or hit DB timeout limits
  2. Create a proportionally large Elasticsearch bulk index request
  3. Block the request thread for an extended period

Consider adding a validation check at the API layer to cap the list size (e.g., 1000 items), or at minimum add a @Size(max=N) annotation on the parameter:

@Valid @Size(max = 1000) List<PipelineStatus> pipelineStatuses

Alternatively, enforce it in PipelineRepository.addBulkPipelineStatus with an early validation check.

Quality: Misleading config name: numberOfStatus actually means days

📄 openmetadata-spec/src/main/resources/json/schema/metadataIngestion/pipelineServiceMetadataPipeline.json:85-90
The JSON schema field numberOfStatus has the description "Number of days of pipeline run status history to ingest" — it represents a number of days, not a number of statuses. The name is misleading and may confuse users into thinking it controls the count of status records to ingest.

In the Databricks connector, it's used as:

lookback_days = self.source_config.numberOfStatus or 1
cutoff_ts = int((datetime.now(timezone.utc) - timedelta(days=lookback_days)).timestamp() * 1000)

Suggestion: Rename to numberOfDays or statusLookbackDays to accurately reflect the semantics. If numberOfStatus is an existing field being repurposed, document the semantic change.

...and 2 more resolved from earlier reviews

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 9, 2026

@ulixius9 ulixius9 merged commit b244798 into main Feb 10, 2026
33 of 37 checks passed
@ulixius9 ulixius9 deleted the batch_pipeline_status branch February 10, 2026 12:44
@github-actions
Copy link
Contributor

Failed to cherry-pick changes to the 1.11.9 branch.
Please cherry-pick the changes manually.
You can find more details here.

ulixius9 pushed a commit that referenced this pull request Feb 10, 2026
* Add bulk apis for pipeline status

* Update generated TypeScript types

* Fix gitar comments

* Update generated TypeScript types

* Fix pycheck

* Address comments

* Fix databricks test

* Move schema changes to 1.11.9

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: harshsoni2024 <harshsoni2024@gmail.com>
manerow added a commit that referenced this pull request Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants